\section{Conclusions}
Modern systems present an overwhelming array of decisions to
application developers who wish to optimize the architecture 
configuration for their application.  Simultaneously, the
rise of multicore machines and parallel architectures has
further complicated this landscape, and the shift to a parallel
paradigm brings forth new issues pertaining to both performance
and correctness.  To facilitate application development in this
context, we have developed VAPP, a profiling, replay, and analysis tool that 
requires no hardware support and can assist in simulating
program execution on different architecture configurations and
evaluating parallel programs.  VAPP stresses transparency, and
all logged execution traces can be easily accessed and queried
by users.

We have demonstrated VAPP's ability to log detailed memory operations,
lock usage, actions of multiple threads, and function calls.  By
storing logs in a SQLite database, users can view all logged data,
query the log with standard SQL queries, and implement custom
low-level analysis with the queries.  The VAPP replay library
facilitates loading the database and simulating program execution
on the desired architecture.  For parallel programs executing on multiple
cores, VAPP can discover unsafe locking, lock usage, and memory
buffers accessed exclusively be a single thread.  We consider this
project to be a success and feel that we have satisfied our 100\%
goals as well as a one of our 125\% goals, lockset verification.

While our decision to pursue the goal of trace accessibility may
ultimately prevent VAPP from scaling to operate on real-world
applications due to the size of the log files and I/O costs, we nevertheless
believe this area of the profiler and replayer
design space was under-explored and our pursuit was worthwhile.
Furthermore, by taking a log-and-replay approach instead of
simulating the target architecture during application execution
as is done in some related tools, VAPP is ideal for developers
who intend to simulate one particular execution trace many
times on different architecture configurations.  Conversely,
VAPP is not as well-suited for applications that run on new
input with every modification to the target architecture.  It will
still function in such cases, but the overhead may make it a less
attractive than alternatives that perform simulation alongside
application execution.

One limitation of VAPP is inherent in
dynamic analysis tools for correctness in parallel programs and
as such cannot be easily overcome.  That is, a dynamic tool
can only detect concurrency bugs that occur during program
execution and cannot guarantee that failure to detect such
bugs means they will not arise under different conditions.
However, it has been shown that given a particular
execution of a program that uses multiple semaphores,
detecting races or determining all possible orderings that could
lead to an indistinguishable execution is NP-hard
\cite{netzer1990complexity}.  As such, we certainly do not consider
this limitation to be due to a weakness of VAPP.

Now the we have implemented the core VAPP framework and basic
replay and analysis functionality, there are many directions
for possible future work.  If VAPP is to become competitive
with related profiling and simulation tools, the size of the
log files when tracing medium to large scale programs must be
addressed.  Our control library to dynamically disable logging
during certain parts of execution is a step in the right direction,
but does not help when the programmer wishes to store and replay
a full trace.  In addition, because VAPP does not take a snapshot
of the current system state when the trace is turned off, it is
possible that replaying only those portions of the application
execution that were logged could result in simulations that differ
from the original execution due to discrepancies in the initial
system state.  Thus, more sophisticated techniques for reducing
the log size must be investigated, ideally without sacrificing
interpretability of the logs.
Previous approaches for reducing the amount
of data that must be logged, such as Netzer's transitive
reduction \cite{netzer1993optimal}, can likely be introduced to
VAPP or custom compression techniques could be developed.  In addition,
VAPP currently bases all of its analysis on a single logged
execution trace, but since the logs are saved in a database it is feasible
to construct more sophisticated forms of analysis that span multiple
executions.  This would enable programmers to study how features
of the underlying architecture affects performance in ways that
are dependent on the input data or control flow.

We believe that expanding VAPP to provide more rich feedback to
developers of parallel applications is the most exciting prospect
for future development.  VAPP is already able to analyze how
threads use shared memory, and a natural next step would be to
feed such information back to the hardware.  Set pinning in
shared caches in multiprocessors as well as Non-Uniform Memory Architecture
systems could benefit from such feedback by localizing data
to the processors that use it exclusively.  Another form of feedback
that could be very useful to the compiler or hardware would be calculating
lock contention statistics from the already tracked log usage patterns.
This would allow the system to use simple locks for very low contention
locks and more space-intensive or complex locks (e.g. queue-based 
or list-based locks) only as needed.  However, because VAPP analysis
is done offline, one would either need to assume that multithread
memory usage patterns do not significantly vary from execution to
execution or that trends could be generalized by running VAPP on
traces from multiple executions.

It would also be very interested to reunite our two Pin tools
so that users could replay parallel programs.  Deterministic
replay is an important area of research, and although VAPP does not
presently does not track parallel execution at the same level as
other (hardware-assisted) approaches \cite{narayansamy2005bugnet},
replaying execution based on the order in locks are acquired and
released would provide some form of deterministic replay.  As with
the aforementioned feedback mechanisms, the data needed for such a
feature is already logged and available, which suggests such an
extension would indeed be feasible.

\subsection{Work Distribution}
Work was split equally between all groups members, therefore each of
the groups members is to be credited for 50\% of the work.
